This report is a portion of the AMIA 2015 Tutorial on Using R for Healthcare Data Science. All code and data available at my GitHub page.
This report will walk you through the data scientist’s workflow and how recent R packages make data science easier and more intuitive. First, let’s start with a couple of disclaimers:
To illustrate how packages released over the past few years have made these tasks easier we will walk through an entire analysis plan using data published by the International Warfarin Pharmacogenomics Consortium available on the PharmGKB website.
Taking a cue from David Rob1, Data Scientist at Stack Overflow, and Philip Guo2, Assistant Professor of Computer Science at University of Rochester, here is my view of the primary computational data science workflow:
The preparation phase of the workflow involves:
The analysis phase consists of:
Finally, the dissemination phase to share the results of their work:
Over the past few years the growth in tools aiding these steps has been phenomenal. We will cover each of these as we move through the workflow steps, but here is a summary of the different packages I’ve found useful for these steps:
Let’s load up our IWPC data! We will be using a slightly modified form of the main data set, that I have manually turned into a tab delimited text file. Although there are a number of libraries to read in excel files, the non-standard column names in the data set make it easier to work with a tsv. We are going to use read.delim() as opposed to readr’s read_tsv() for two reasons:
This last reason is the deal breaker for readr. Readr interpolates the variable type (column, date, number, etc.) based on the first 100 rows or via manual specification. Given the large number of columns (22) this becomes annoying at best. However, since we can’t take advantage of readr automatically making a tbl_df() object, so we will have to do so manually.
iwpc_data <- read.delim(file = "iwpc_data_7_3_09_revised3.txt") %>% tbl_df()
Let’s take a look at the type of data we are working with.
It is also important in R to know what data types are being used, as this can affect the behaviour of some functions.
iwpc_data %>%
map(~class(.x)) %>%
t() %>%
as.data.frame() %>%
mutate(Variable_Name = rownames(.), Variable_Type = V1) %>%
select(Variable_Name, Variable_Type) %>%
datatable(rownames = FALSE, options = list(paging = FALSE, bFilter = FALSE, info = FALSE), extensions = 'FixedHeader')
Looking at our data above we see there are a number of problems:
Let’s first fix the column names - which is easy to do with dplyr.
iwpc_data %<>%
rename(subject_id = PharmGKB.Subject.ID,
sample_id = PharmGKB.Sample.ID,
project_site = Project.Site,
gender = Gender,
race_reported = Race..Reported.,
race_omb = Race..OMB.,
ethnicity_reported = Ethnicity..Reported.,
ethnicitiy_omb = Ethnicity..OMB.,
age = Age,
height = Height..cm.,
weight = Weight..kg.,
indication = Indication.for.Warfarin.Treatment,
comorbidities = Comorbidities,
medications = Medications,
target_inr = Target.INR,
target_inr_estimated = Estimated.Target.INR.Range.Based.on.Indication,
reached_stable_dose = Subject.Reached.Stable.Dose.of.Warfarin,
therapeutic_warfarin_dose = Therapeutic.Dose.of.Warfarin,
inr_on_warfarin = INR.on.Reported.Therapeutic.Dose.of.Warfarin,
smoker = Current.Smoker,
cyp2c9_consensus = CYP2C9.consensus,
vkorc1_1639_consensus = VKORC1..1639.consensus)
To fix the Target INR problem, we will need to use some basic string functions from stringr. First though let’s get a good look at the extent of the problem by looking at all the distinct values in this field.
iwpc_data %>%
count(target_inr_estimated) %>%
datatable(rownames = FALSE, colnames = c("Target INR", "N"), options = list(order = list(1, "dsc"), paging = FALSE, bFilter = FALSE, info = FALSE), extensions = 'FixedHeader')
As of the writing of this tutorial there is not a conditional mutate function in dplyr (though it’s being discussed by the development team). Because of that, we must make our own conditionals with ifelse().
iwpc_data %<>%
mutate(target_inr_estimated = as.character(target_inr_estimated)) %>%
mutate(target_inr_estimated = ifelse(target_inr_estimated == "3-Feb",
yes = "2-3",
no = ifelse(target_inr_estimated == "4-Mar",
yes = "3-4",
no = target_inr_estimated)))
And then checking our work:
iwpc_data %>%
count(target_inr_estimated) %>%
datatable(rownames = FALSE, colnames = c("Target INR", "N"), options = list(order = list(1, "dsc"), paging = FALSE, bFilter = FALSE, info = FALSE), extensions = 'FixedHeader')
In this case since we are trying to replicate an existing analysis plan we have the easy job of just trying to massage our data into the variables used the previous model. In this case they used a lot of dummy coded variables in their analysis.
They Used:
We Have:
Based on this it looks like we will need to:
Again, always look at your data to make sure it follows the formatting you’re expecting.
iwpc_data %>%
count(age) %>%
datatable(rownames = FALSE, colnames = c("Age", "N"), options = list(order = list(1, "dsc"), paging = FALSE, bFilter = FALSE, info = FALSE), extensions = 'FixedHeader')
iwpc_data %<>%
mutate(age = as.character(age)) %>%
mutate(age = ifelse(age == "19-Oct",
yes = "10 - 19", no = age))
Confirm it’s fixed:
Now to the fun stuff. In this case we really only need the first number of the string. This is not the safest transformation, but it makes the processing easier and code less complex than coding ifelse() statements for each case. I always like to look at the impact of my work before confirming the change.
iwpc_data %>%
count(age,
substr(age,1,1),
as.numeric(substr(age,1,1))) %>%
datatable(rownames = FALSE, colnames = c("Age", "Substring of Age", "Numeric Version of Substring", "N"), options = list(order = list(3, "dsc"), paging = FALSE, bFilter = FALSE, info = FALSE), extensions = 'FixedHeader')
iwpc_data %<>%
mutate(age_decades = as.numeric(substr(age,1,1)))
#### Dummy Code VKORC1 Genotypes
Again first look at the genotypes.
iwpc_data %>%
count(vkorc1_1639_consensus) %>%
datatable(rownames = FALSE, colnames = c("VKORC1 Genotype", "N"), options = list(order = list(1, "dsc"), paging = FALSE, bFilter = FALSE, info = FALSE), extensions = 'FixedHeader')
iwpc_data %>%
mutate(vkorc1_1639_ag = ifelse(str_detect(vkorc1_1639_consensus,"A/G"),
yes = 1, no = 0),
vkorc1_1639_aa = ifelse(str_detect(vkorc1_1639_consensus, "A/A"),
yes = 1,no = 0),
vkorc1_1639_unknown = ifelse(is.na(vkorc1_1639_consensus),
yes = 1,no = 0)) %>%
count(vkorc1_1639_consensus,vkorc1_1639_ag,vkorc1_1639_aa,vkorc1_1639_unknown) %>%
datatable(colnames = c("VKORC1 1639","VKORC1 A/G","VKORC1 A/A","VKORC1 Unknown","N"), rownames = FALSE, options = list(pageLength = 12, bFilter = FALSE, info = FALSE, paging = FALSE))
Wait! I thought we set the values for VKORC1 A/G and VKORC1 A/A to 0 if it didn’t match the regular expression! Why are those fields blank when the VKORC1 genotype was missing?
Well that has to do with how R handles NA values. If you don’t know what the value for that field is, R has no idea if the regex matches. Hypothetically that person could match that genotype we just don’t know. Because of this we have to adjust our approach. Now instead of defining you by what you are, we’ll group you by what you are not: If you are not NA and you are not equal to A/G we set you to 0 otherwise (e.g., you equal A/G) we set you to one.
iwpc_data %<>%
mutate(vkorc1_1639_ag = ifelse(is.na(vkorc1_1639_consensus) |
!str_detect(vkorc1_1639_consensus,"A/G"),
yes = 0, no = 1),
vkorc1_1639_aa = ifelse(is.na(vkorc1_1639_consensus) |
!str_detect(vkorc1_1639_consensus, "A/A"),
yes = 0, no = 1),
vkorc1_1639_unknown = ifelse(is.na(vkorc1_1639_consensus),
yes = 1, no = 0))
And checking our work:
Based on the same logic as VKORC1, we will use the exclusionary strategy for dummy coding CYP2C9 genotype. Note that here we have a lot more genotypes than we use in the model.
iwpc_data %>%
count(cyp2c9_consensus) %>%
datatable(rownames = FALSE, colnames = c("CYP2C9 Genotype", "N"), options = list(order = list(1, "dsc"), paging = FALSE, bFilter = FALSE, info = FALSE), extensions = 'FixedHeader')
iwpc_data %<>%
mutate(cyp2c9_1_2 = ifelse(is.na(cyp2c9_consensus) |
!str_detect(cyp2c9_consensus,"\\*1/\\*2"),
yes = 0, no = 1),
cyp2c9_1_3 = ifelse(is.na(cyp2c9_consensus) |
!str_detect(cyp2c9_consensus,"\\*1/\\*3"),
yes = 0, no = 1),
cyp2c9_2_2 = ifelse(is.na(cyp2c9_consensus) |
!str_detect(cyp2c9_consensus,"\\*2/\\*2"),
yes = 0, no = 1),
cyp2c9_2_3 = ifelse(is.na(cyp2c9_consensus) |
!str_detect(cyp2c9_consensus,"\\*2/\\*3"),
yes = 0, no = 1),
cyp2c9_3_3 = ifelse(is.na(cyp2c9_consensus) |
!str_detect(cyp2c9_consensus,"\\*3/\\*3"),
yes = 0, no = 1),
cyp2c9_unknown = ifelse(is.na(cyp2c9_consensus),
yes = 1,no = 0))
Checking our work:
iwpc_data %>%
count(race_omb) %>%
datatable(rownames = FALSE, colnames = c("Race", "N"), options = list(order = list(1, "dsc"), paging = FALSE, bFilter = FALSE, info = FALSE), extensions = 'FixedHeader')
This is clean and easy to fix with mutate and ifelse().
iwpc_data %<>%
mutate(asian = ifelse(str_detect(race_omb, "Asian"),
yes = 1,
no = 0),
african_american = ifelse(str_detect(race_omb, "Black or African American"),
yes = 1,
no = 0),
missing_or_mixed_race = ifelse(str_detect(race_omb, "Unknown"),
yes = 1,
no = 0))
Checking our work:
iwpc_data %>%
count(race_omb, asian, african_american, missing_or_mixed_race) %>%
datatable(colnames = c("Race OMB","Asian","African American","Missing/Mixed Race","N"), rownames = FALSE, options = list(pageLength = 12, bFilter = FALSE, info = FALSE, paging = FALSE))
The medications column is an ugly beast. Remember how this data came from multiple studies? Well that means this field has a lot going on. Some sites asked about specific drugs and only include if they took those drugs or not. Others are from EHR linked databases and they simply exported the medication list of the patient (don’t even begin to ask which medication list - timepoint, copy/paste hold overs etc.!).
Let’s take a quick look at the format to see what we’re getting ourselves into.
iwpc_data %>%
count(medications) %>%
datatable(rownames = FALSE, colnames = c("Medications", "N"), options = list(order = list(1, "dsc"), paging = FALSE, bFilter = FALSE, info = FALSE, scrollY = '300px'))
Let’s filter the medications to look at those matching amiodarone
iwpc_data %>%
filter(str_detect(medications, "amiodarone")) %>%
count(medications) %>%
datatable(rownames = FALSE, colnames = c("Medications", "N"), options = list(order = list(1, "dsc"), paging = FALSE, bFilter = FALSE, info = FALSE, scrollY = '300px'))
From this look we know that there are a lot of qualifiers and other textual clues. A really great trick for regex development is to use str_extract to get just the snippet of text mentioning amiodarone. Before we get there though, always make sure your more complex regex is grabbing the same number of rows as your general regex.
iwpc_data %>% filter(str_detect(medications, "amiodarone")) %>% count()
## Source: local data frame [1 x 1]
##
## n
## (int)
## 1 1160
iwpc_data %>% filter(str_detect(medications, "(^|;)[a-z ]*amiodarone[a-z ]*($|;)")) %>% count()
## Source: local data frame [1 x 1]
##
## n
## (int)
## 1 1160
Then look at the text snippet:
iwpc_data %>%
mutate(amiodarone_text = str_extract(medications, "(^|;)[a-z ]*amiodarone[a-z ]*($|;)")) %>%
count(amiodarone_text) %>%
datatable(rownames = FALSE, colnames = c("Amiodarone_Snippet", "N"), options = list(order = list(1, "dsc"), paging = FALSE, bFilter = FALSE, info = FALSE), extensions = 'FixedHeader')
Now let’s write a regex to extract amiodarone only where it does not say, “not” or “no” amiodarone.
iwpc_data %>%
mutate(amiodarone_text = str_extract(medications, "(^|;)[a-z ]*amiodarone[a-z ]*($|;)"),
amiodarone_bool = ifelse( !is.na(medications) & str_detect(medications, "(?<!not? )amiodarone"),
yes = 1,
no = 0)) %>%
count(amiodarone_text, amiodarone_bool) %>%
datatable(rownames = FALSE, colnames = c("Amiodarone_Snippet", "Amiodarone_Detector", "N"), options = list(order = list(1, "dsc"), paging = FALSE, bFilter = FALSE, info = FALSE), extensions = 'FixedHeader')
Great! Our regex works let’s implement it!
iwpc_data %<>%
mutate(amiodarone = ifelse( !is.na(medications) & str_detect(medications, "(?<!not? )amiodarone"),
yes = 1,
no = 0))
I will leave it an excercise for you to go through the regex development process with these drugs, but here are the finished regexes.
iwpc_data %<>%
mutate(carbamazepine = ifelse(!is.na(medications) & str_detect(medications,"(?<!not )carbamazepine"), yes = 1, no = 0),
phenytoin = ifelse(!is.na(medications) & str_detect(medications,"(?<!not )phenytoin"),yes = 1,no = 0),
rifampin = ifelse(!is.na(medications) & str_detect(medications,"(?<!not )rifampin"),yes = 1,no = 0),
rifampicin = ifelse(!is.na(medications) & str_detect(medications,"(?<!not )rifampicin"),yes = 1,no = 0))
Remember though, we only need enzyme inducer status - i.e. did the patient take any of these drugs? Thankfully an easy way to create this variable is just take add up the four medications columns - if it is greater than 1 they took at least one of the medications!
iwpc_data %<>%
mutate(enzyme_inducers = ifelse((carbamazepine + phenytoin + rifampin + rifampicin) > 0, yes = 1, no = 0))
Checking our data:
Phew! Now that our data cleaning is finished, let’s get down to the fun - modeling our data!
I’m a big proponent of visualizing your data to make sure there’s nothing wonky happening. Let’s take a look at the outcome of interest: stable warfarin dose.
iwpc_data %>%
ggplot(aes(x = 1, y = therapeutic_warfarin_dose)) + geom_boxplot()
Oh, that’s not pretty. In fact it is common for warfarin dose to use the sqrt of the final dose.
iwpc_data %>%
ggplot(aes(x = 1, y = sqrt(therapeutic_warfarin_dose))) + geom_boxplot()
Let’s make a transformed outcome variable that is the square root of the therapeutic dose.
iwpc_data %<>% mutate(sqrt_warfarin_dose = sqrt(therapeutic_warfarin_dose))
We can use the lm() function to run a linear model
iwpc_data %>%
lm(formula = sqrt_warfarin_dose ~ age_decades + vkorc1_1639_ag + vkorc1_1639_aa + vkorc1_1639_unknown + cyp2c9_1_2 + cyp2c9_1_3 + cyp2c9_2_2 + cyp2c9_2_3 + cyp2c9_3_3 + asian + african_american + missing_or_mixed_race + amiodarone + enzyme_inducers)
##
## Call:
## lm(formula = sqrt_warfarin_dose ~ age_decades + vkorc1_1639_ag +
## vkorc1_1639_aa + vkorc1_1639_unknown + cyp2c9_1_2 + cyp2c9_1_3 +
## cyp2c9_2_2 + cyp2c9_2_3 + cyp2c9_3_3 + asian + african_american +
## missing_or_mixed_race + amiodarone + enzyme_inducers, data = .)
##
## Coefficients:
## (Intercept) age_decades vkorc1_1639_ag
## 8.28093 -0.28379 -0.80007
## vkorc1_1639_aa vkorc1_1639_unknown cyp2c9_1_2
## -1.56451 -0.59583 -0.46530
## cyp2c9_1_3 cyp2c9_2_2 cyp2c9_2_3
## -0.89206 -1.08356 -1.86263
## cyp2c9_3_3 asian african_american
## -2.49990 -0.68607 -0.07697
## missing_or_mixed_race amiodarone enzyme_inducers
## -0.34162 -0.66829 0.53980
iwpc_data %>%
lm(formula = sqrt_warfarin_dose ~ age_decades + vkorc1_1639_ag + vkorc1_1639_aa + vkorc1_1639_unknown + cyp2c9_1_2 + cyp2c9_1_3 + cyp2c9_2_2 + cyp2c9_2_3 + cyp2c9_3_3 + asian + african_american + missing_or_mixed_race + amiodarone + enzyme_inducers) %>%
summary()
##
## Call:
## lm(formula = sqrt_warfarin_dose ~ age_decades + vkorc1_1639_ag +
## vkorc1_1639_aa + vkorc1_1639_unknown + cyp2c9_1_2 + cyp2c9_1_3 +
## cyp2c9_2_2 + cyp2c9_2_3 + cyp2c9_3_3 + asian + african_american +
## missing_or_mixed_race + amiodarone + enzyme_inducers, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9065 -0.6976 -0.0184 0.6260 11.4025
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.28093 0.07207 114.901 < 2e-16 ***
## age_decades -0.28379 0.01027 -27.642 < 2e-16 ***
## vkorc1_1639_ag -0.80007 0.04496 -17.797 < 2e-16 ***
## vkorc1_1639_aa -1.56451 0.05441 -28.754 < 2e-16 ***
## vkorc1_1639_unknown -0.59583 0.04434 -13.437 < 2e-16 ***
## cyp2c9_1_2 -0.46530 0.04654 -9.999 < 2e-16 ***
## cyp2c9_1_3 -0.89206 0.05349 -16.677 < 2e-16 ***
## cyp2c9_2_2 -1.08356 0.15039 -7.205 6.58e-13 ***
## cyp2c9_2_3 -1.86263 0.13755 -13.542 < 2e-16 ***
## cyp2c9_3_3 -2.49990 0.24519 -10.196 < 2e-16 ***
## asian -0.68607 0.04462 -15.376 < 2e-16 ***
## african_american -0.07697 0.05797 -1.328 0.1844
## missing_or_mixed_race -0.34162 0.05457 -6.260 4.13e-10 ***
## amiodarone -0.66829 0.08113 -8.237 < 2e-16 ***
## enzyme_inducers 0.53980 0.22437 2.406 0.0162 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.092 on 5474 degrees of freedom
## (211 observations deleted due to missingness)
## Multiple R-squared: 0.4027, Adjusted R-squared: 0.4012
## F-statistic: 263.7 on 14 and 5474 DF, p-value: < 2.2e-16
So if we were running multiple models or trying to do any processing of these models it would be really annoying to work with the above output. In fact R does not even store the p-value for each coefficient. Those are all calculated on fly when you view the summary of the model. Thankfully David Rob made a package called Broom that fixes this beautifully!
model <- iwpc_data %>% lm(formula = sqrt_warfarin_dose ~ age_decades + vkorc1_1639_ag + vkorc1_1639_aa + vkorc1_1639_unknown + cyp2c9_1_2 + cyp2c9_1_3 + cyp2c9_2_2 + cyp2c9_2_3 + cyp2c9_3_3 + asian + african_american + missing_or_mixed_race + amiodarone + enzyme_inducers)
warfarin_pharmacogenomic_model <- tidy(model)
warfarin_pharmacogenomic_model
We can also see the overall model fit information in a clean dataframe:
glance(model) %>%
datatable(options = list(paging = FALSE, bFilter = FALSE, info = FALSE, scrollX = TRUE, columnDefs = list(list(className = "dt-center", targets = c(0:4)))))
Let’s make a Shiny Application that takes inputs, uses the model weights we developed and gives out a predicted warfarin dose for the visitor:
shinyApp(
ui = pageWithSidebar(
headerPanel("Warfarin Pharmagenomic Dose Predictor"),
sidebarPanel(selectInput("age", "Age in Decades:", choices = c("10-19","20-29","30-39", "40-49", "50-59", "60-69", "70-79", "80-89", "90+")),
radioButtons("race", "Race:", choices = c( "Other or Unknown", "Asian", "African American", "White")),
radioButtons("vkorc1", "VKORC1 Genotype:", choices = c("Unknown", "G/G", "G/A", "A/A")),
radioButtons("cyp2c9", "CYP2C9 Genotype:", choices = c("Unknown", "*1/*2", "*1/*3", "*2/*2", "*2/*3", "*3/*3","Other")),
checkboxInput('amiodarone', "Taking Amiodarone", FALSE),
checkboxInput("enzyme_inducers", "Taking an Enzyme Inducer (rifampin, carbamazepine, phenytoin or rifampicin)", FALSE),
actionButton("calc","Calculate")
),
mainPanel(
strong(em("THIS IS A PROGRAMMING EXAMPLE ONLY - DO NOT USE FOR PATIENT CARE!")),
br(),br(),
strong(em("IF YOU HAVE QUESTIONS ABOUT YOUR WARFARIN DOSE, PLEASE CONTACT YOUR DOCTOR.")),
tableOutput("selectedvalues"),
textOutput("warfarindose")
)
),
server = function(input, output){
library(dplyr)
input_model <- eventReactive(input$calc, {
data.frame(Age = input$age,
Race = input$race,
VKORC1 = input$vkorc1,
CYP2C9 = input$cyp2c9,
On_Amiodarone = input$amiodarone,
On_Enzyme_Inducers = input$enzyme_inducers)
})
output$selectedvalues <- renderTable({input_model() %>%
gather(key = Variable, value = Selection)})
output$warfarindose <- renderText({
warfarin_model <- structure(list(term = c("intercept", "age_decades", "vkorc1_1639_ag", "vkorc1_1639_aa", "vkorc1_1639_unknown", "cyp2c9_1_2", "cyp2c9_1_3", "cyp2c9_2_2", "cyp2c9_2_3", "cyp2c9_3_3", "asian", "african_american", "missing_or_mixed_race", "amiodarone", "enzyme_inducers"), estimate = c(8.28092953788836, -0.283786900016807, -0.800068515352432, -1.56451454024173, -0.595825280765066, -0.465304224113711, -0.892056417892715, -1.08355935593238, -1.86263474384314, -2.49990281796991, -0.686065650288828, -0.0769652407947399, -0.341619674379834, -0.668291885082912, 0.539804214460087)), row.names = c(NA, -15L), class = "data.frame", .Names = c("term", "estimate"))
predicted_dose <- input_model() %>%
mutate(intercept = 1,
age_decades = as.numeric(substr(Age, 1, 1)),
vkorc1_1639_ag = ifelse(VKORC1 == "G/A", 1, 0),
vkorc1_1639_aa = ifelse(VKORC1 == "A/A", 1, 0),
vkorc1_1639_unknown = ifelse(VKORC1 == "Unknown", 1, 0),
cyp2c9_1_2 = ifelse(CYP2C9 == "*1/*2", 1, 0),
cyp2c9_1_3 = ifelse(CYP2C9 == "*1/*3", 1, 0),
cyp2c9_2_2 = ifelse(CYP2C9 == "*2/*2", 1, 0),
cyp2c9_2_3 = ifelse(CYP2C9 == "*2/*3", 1, 0),
cyp2c9_3_3 = ifelse(CYP2C9 == "*3/*3", 1, 0),
asian = ifelse(Race == "Asian", 1, 0),
african_american = ifelse(Race == "African American", 1, 0),
missing_or_mixed_race = ifelse(Race == "Other or Unknown", 1, 0),
amiodarone = ifelse(On_Amiodarone, 1, 0),
enzyme_inducers = ifelse(On_Enzyme_Inducers, 1, 0)) %>%
select(-c(Age:On_Enzyme_Inducers)) %>%
gather(key = term, value = value) %>%
mutate(term = as.character(term)) %>%
inner_join(warfarin_model) %>%
mutate(weighted = value * estimate) %>%
summarise(round(sum(weighted)^2))
paste("Based on the above values, the predicted warfarin dose is: ", predicted_dose, "mg per week")
})
},
options = list(height = 1000)
)